Can we tell the author of a message, without reading the message? This work tackles authorship analysis through features that ignore the explicit content of a contribution - informally, those that can be computed even if every character in the body of a message (but not metadata such as timing or \likes") is replaced by an X. Focusing on forum posts, we distil a case-study set of these content-agnostic features, and prove its viability for authorship verification and attribution, using data from four online forums (of different size, language, and topic). A simple classification testbed, relying exclusively on content-agnostic features, confirms the author of a message with 76% accuracy, and discriminates between two candidate authors with 94% accuracy. Being able to re-identify a user without looking at the content of her contributions poses a serious threat to common data anonymization practices.
Content attribution ignoring content / Samory, M.; Peserico, E.. - (2016), pp. 233-243. (Intervento presentato al convegno 8th ACM Web Science Conference, WebSci 2016 tenutosi a Hannover, DE) [10.1145/2908131.2908156].
Content attribution ignoring content
Samory M.;
2016
Abstract
Can we tell the author of a message, without reading the message? This work tackles authorship analysis through features that ignore the explicit content of a contribution - informally, those that can be computed even if every character in the body of a message (but not metadata such as timing or \likes") is replaced by an X. Focusing on forum posts, we distil a case-study set of these content-agnostic features, and prove its viability for authorship verification and attribution, using data from four online forums (of different size, language, and topic). A simple classification testbed, relying exclusively on content-agnostic features, confirms the author of a message with 76% accuracy, and discriminates between two candidate authors with 94% accuracy. Being able to re-identify a user without looking at the content of her contributions poses a serious threat to common data anonymization practices.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.